Advances in Large-Scale RDF Data Management
نویسندگان
چکیده
One of the prime goal of the LOD2 project is improving the performance and scalability of RDF storage solutions so that the increasing amount of Linked Open Data (LOD) can be efficiently managed. Virtuoso has been chosen as the basic RDF store for the LOD2 project, and during the project it has been significantly improved by incorporating advanced relational database techniques from MonetDB and Vectorwise, turning it into a compressed column store with vectored execution. This has reduced the performance gap (“RDF tax”) between Virtuoso’s SQL and SPARQL query performance in a way that still respects the “schema last” nature of RDF. However, by lacking schema information, RDF database systems such as Virtuoso still cannot use advanced relational storage optimizations such as table partitioning or clustered indexes and have to execute SPARQL queries with many self-joins to a triple table, which leads to more join effort than needed in SQL systems. In this chapter, we first discuss the new column store techniques applied to Virtuoso, the enhancements in its cluster parallel version, and show its performance using the popular BSBM benchmark at the unsurpassed scale of 150 billion triples. We finally describe ongoing work in deriving an “emergent” relational schema from RDF data which, can help to close the performance gap between relational-based and RDF-based storage solutions. 1.1 General Objectives One of the objectives of the LOD2 EU project is to boost the performance and the scalability of RDF storage solutions so that it can, efficiently manage huge datasets (e.g., one trillion RDF triples) of Linked Open Data (LOD). However, it has been Peter Boncz, Minh-Duc Pham CWI, Amsterdam e-mail: {P.Boncz,duc}@cwi.nl Orri Erling OpenLink Software, U.K. e-mail: [email protected]
منابع مشابه
Hash Tree Indexing for Fast SPARQL Query in Large Scale RDF Data Management Systems
Abstract. In the past decade, the volume of RDF (Resource Description Framework, which is a standard model for data interchange on the Web) data has grown enormously, and many RDF datasets (e.g., Wikipedia) have reached up to billions of triples. As a result, efficient management of this huge RDF data has become a tremedous challenge. In this paper, we present HTStore, a hash tree based system ...
متن کاملSemiometrics: Applying Ontologies across Large-Scale Digital Libraries
As large-scale digital libraries become more available and complete, not to mention more numerous, it is clear there is a need for services that can draw together and perform inference calculations on the metadata produced. However, the traditional Relational Database Management System (RDBMS) model, while efficiently constructed and optimised for many business structures, does not necessarily ...
متن کاملA Scalable Analysis Framework for Large-scale Rdf Data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges modern knowledge storage and discovery techniques. Research and engineering on RDF data management systems is a very active area with many standalone systems being introduced. However, as the size of RDF data increases, ...
متن کاملScalable Semantic Web Data Management Using Vertical Partitioning
Efficient management of RDF data is an important factor in realizing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these app...
متن کاملE2DR: Energy Efficient Data Replication in Data Grid
Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...
متن کامل